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Abstract 


This paper addresses the problem of neighborhood selection for Gaussian graphi¬ 
cal models. We present two heuristic algorithms: a forward-backward greedy al¬ 
gorithm for general Gaussian graphical models based on mutual information test, 
and a threshold-based algorithm for walk summable Gaussian graphical models. 
Both algorithms are shown to be structurally consistent, and efficient. Numerical 
results show that both algorithms work very well. 


1 Introduction 

Gaussian graphical model is a very powerful statistical tool in many fields. By encoding the condi¬ 
tional dependency structure of different variables into the structure of a sparse graph, it helps reveal 
simple structure beneath high dimensional, and often complicated observed data. Eor example, in 
bioinformatics, Gaussian graphical models are often used to represent the Markovian dependence 
structure among a vast pool of genes, based on observations on gene expression data. In other fields, 
Guassian graphical models can be used for spatial interpolation. On a graph where nodes represents 
the geological locations and edges represents direct impact, Gaussian graphical models can be used 
to infer data at unobserved locations when data for a small number of locations are available. In 
geostatistics, Gaussian graphical models with small neighborhood size can be used to approximate 
much denser Gaussian random fields, which greatly reduces the computational complexity due to the 
availability of fast computation techniques for sparse matrices M- Meanwhile, Gaussian graphical 
models also play an important role in belief propagation, and the related applications such as error 
control coding 02. 

Eor many applications of Gaussian graphical models, estimating the graphical structure, whose ad¬ 
jacency matrix is essentially the support of the inverse covariance matrix of the joint Gaussian dis¬ 
tribution, has been a very interesting topic. Generally speaking, most effort approaches the problem 
from two different aspect, either to estimate the inverse covariance matrix as a whole, or to combine 
the estimation of the entries of the neighborhood of each individual node. Two typical examples are 
Graphical Lasso (GLasso) 0, and Neighborhood Lasso (NLasso) lfT3l . respectively. 

1.1 Related Work 

Recently, there has been several more efficient methods that recover the structure of the neighbor¬ 
hood (or the entire graphical structure) greedily. In ifTOl . Johnson et.al proposed a forward-backward 
greedy method in estimating the neighborhood entries of each node. The forward part of the method 
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greedily selects the node that essentially maximizes the mutual information between the node of 
interest and the estimated neighborhood. Since no theoretical guarantees can be provided that such 
selection would always be correct unless the size of the neighborhood is 1 , the backward pruning 
algorithm is designed to prune potentially false neighbors. Theoretically, the authors proved that 
under certain conditions, with the most important one being the restricted eigenvalue condition, the 
forward-backward algorithm is structurally consistent, with a sample complexity at the state of the 
art level, superior to NLasso. Another interesting work is m, in which the authors showed that if 
the Gaussian graphical model is walk summable, and the number of paths whose lengths are at most 
7 between a node and any of its non-neighbors is restricted to be smaller than a fixed number 77 , then 
the conditional mutual information between a node and a non-neighbor is upper bounded, while 
the conditional mutual information between a node and a neighbor can be lower bounded, when 
the set conditioned on includes at least one node for each distinct path of at most length 7 between 
the node and its non-neighbor. Hence, computing all the conditional mutual information between 
two nodes where the set conditioned on has at most cardinality p, and applying a threshold would 
succeed in finding the neighborhood, asympotically. Finally, the authors showed that under further 
assumptions, the test is structurally consistent for almost every graph as the graph size approaches 
infinity. 

Even more recently, there has been a line of research that focuses on lower bounding the maximal 
influence between a node and its undiscovered neighbors, or certain information distance between 
two graphs if they differ by at least one edge. In 13, Bresler showed that for Ising models, if the 
neighborhood hasn’t been completely discovered, then the maximal influence for a node from one 
of its undiscovered neighbors can be lower bounded by a constant away from 0. Hence, thresholding 
the influence repeatedly would guarantee the selection of all neighbors. Moreover, the total number 
of nodes selected can be upper bounded by a quantity independent of the graph size. Hence pruning 
non-neighbors can be done efficiently. For Gaussian graphical models. Jog et.al showed in ID that 
if two Gaussian graphical models differ by at least one edge, then the KL divergence between the 
joint distributions of the two models is bounded away from 0. Hence, if we are given a set of sparse 
candidate graphs, one can simply obtain the solution by maximizing the likelihood. 

1.2 Our contribution 

In this work, we follow the footsteps of the research effort mentioned above. In particular, we focus 
on two questions. (1) Can we greedily select the neighborhood of each node based on an information 
theoretic measurement? (2) Does there exist a natural way of measuring the influence between two 
nodes, and can the maximum influence between a node and its undiscovered neighbors be bounded 
away from zero at any time? 

We give affirmative answers to both questions. For the first question, we develop a forward- 
backward type greedy algorithm, which selects the neighborhood of each node iteratively. In each 
round, the algorithm picks a new neighbor that maximize the conditional mutual information be¬ 
tween two nodes conditioned on the already selected pseudo neighborhood, and prunes all the least 
likely neighbors, until the conditional mutual information between the node and potential new neigh¬ 
bors is below a threshold. For this algorithm, we show its structural consistency, while the efficiency 
is demonstrated numerically. For the second question, we show that, for walk summable Gaussian 
graphical models, one can always lower bound the absolute value of the maximal conditional co- 
variance between a node and its undiscovered neighbor by a constant. This property enables us to 
design efficient thresholding and pruning algorithms, which first selects a pseudo neighborhood that 
contains all true neighbors with high probability, and then prunes non-neighbors efficiently. Both 
the performance and efficiency of the algorithm are characterized theoretically and numerically. 

The rest of the paper is organized as follows. In Section |2] we introduce the general set of nota¬ 
tion we adopt, and introduce the preliminaries to Gaussian graphical models. The category of walk 
summable Gaussian graphical models is also introduced, with more details included in Section|3 In 
Section 13 we propose the greedy neighborhood selection algorithm, and prove its structural consis¬ 
tency. In Section m we propose the thresholding algorithm, and analyze its efficiency, correctness. 
The pruning algorithm is also given in Section H) and the overall structural consistency and sample 
complexity are derived. In addition, we also show in Section|4]that our assumptions required for the 
algorithm are not restrictive in the sense that they do not prohibit the graph to scale up to size in¬ 
finity. In Section|3we demonstrate the performance of our algorithm, and compare it to benchmark 
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algorithms. The conclusions are then presented, and the proofs as well as the detailed introduction 
on walk summability is included in Section]?] 


2 Preliminaries 

2.1 Gaussian graphical models 

Throughout the paper, the following set of notation is commonly used. We denote the underlying 
Gaussian graphical model by (V, E), where = {1,..., n} with index i corresponding to the i-th 
dimension of the joint distribution. The covariance and inverse covariance matrices are denoted by 
S and J, respectively. The empirical covariance matrix is denoted by S. For either S or J, we 
use Es^t or Js^t to denote the submatrices obtained by picking the intersections of the rows and 
columns with indices in sets S and T, respectively. We assume that S and T are ordered sets, i.e., 
if S' = {s(l), s(|S|)} contains more than 1 element, then s{i) < s{j) for i < j. This rigorously 

specifies the way of writing down the submatrix, and permits us to refer to the i-th element of S, 
which corresponds to a unique node in the graph, without raising any confusions. When referring to 
an element instead of a submatrix, we simply write or Jij where i and j are the indices of the 
nodes involved. For a given graph, we denote the neighbors of a node i by Afi, and let Afi be all the 
non-neighbors of i. The estimated set of neighbors for node i is denoted as Si. In addition, for any 
set S, we denote its complement by 5'“. 

Given an n-dimensional Gaussian random vector X = (Xi,..., X„) £ R”, there are two common 
ways to write the probability density function ^(x). The first way is to write ^{x) in the covariance 
form, characterized by the mean vector m = ]E[X] and the covariance matrix E = ]E[(X — m)^]. 
The second way is to write ii{X) in the information form, characterized by the information matrix 
(also known as the precision matrix or the inverse covariance matrix) J = E“^ and the potential 
vector h = E“^m = Jm. More specifically, 

X'^JX + h^xX. ( 1 ) 


The information matrix J £ is symmetric and positive definite. It encodes the conditional 

dependency structure between different dimensions of X. The most well known result is that for 
any i,j £ V, Jij = 0 if and only if Xi _L XjlXy^j^i jy 

Furthermore, for any S C V, denote := V\S, then it is known that 

Js,s = Ss,S|S= = - Ss.S'=S5JscEs (2) 

“ {Js,s ~ 50/5^50) . ( 3 ) 

These two equations present two common operations used for Gaussian graphical models: condi¬ 
tioning, and marginalization. When marginalization is performed (equation (O), a new set of edges 
is introduced into the subgraph with vertexes S by the additive term Jj gc. If we wish to 

preserve the original graphical structure over vertexes S, then the operation needed is conditioning, 
as indicated in equation (|2]i. 

When building the thresholding algorithm, we consider a category of Gaussian graphical models 
with strong intuition, the walk summable Gaussian graphical models. For the purpose of analysis, it 
is sufficient to know that such category of Gaussian graphical models is parameterized by a £ (0,1), 

and a Gaussian graphical model is said to be a walk summable if |||/ — '/D J^/IJ |||2 < a, 
where D is the diagonal matrix of J, and || • II 2 is the spectral norm. The strong intuition of a walk 
summable Gaussian graphical models is that the covariance E^j can be represented as the summation 
of walk weights along the edges of the graph from iXo j. A detailed introduction on walk summable 
Gaussian graphical models can be found in Appendix l?.!! 


Im{X) cx exp ^{X — m)\ cx exp 
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2.2 Information theoretic qnantities 


The mutual information between Xi and Xj and the conditional mutual information between Xi 
and Xj conditioned on a set of random variables Xs are defined as 


I(Xf,Xj)=E 


log 


fXi, 


Xi 


fXifXj, 


I{X,;X,\Xs)=E 


log 


fXi, 


Xj\Xs 


fXi\XsfXj\Xs. 


where / denotes the probability density function and the expectations are with respect to the joint 
distributions. For a set of jointly Gaussian random variables, it is known that HI 

I{Xi-Xj) = ^log \ , I {Xi-X:j\Xs) = \\og- - ^ -, (4) 

^ ^ ^ ^ PiJlS 


where pij and Pijjs the correlation coefficient between Xi and Xj and the conditional cor¬ 
relation coefficient given Xs, respectively. Recall the definition of the conditional correlation 
coefficient; Pijis = '^ij\s/\/^ii\s'^jj\S■ Hence, the empirical mutual information is given by 


i{Xi-,X,\Xs) := ilog 


SjilsSjjIs 
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2.3 Forward-Backward algorithm 

Zhang introduced a forward-backward greedy algorithm for sparse linear regression that begins with 
an empty set of active variables and gradually adds and removes variables to the active set flTl . The 
algorithm has two steps: the forward step and the backward step. In the forward step, the algorithm 
finds the best next candidate that minimizes a loss function and adds it to the active set as long as 
the improvement is greater than a certain threshold. In the backward step, the algorithm checks 
the influence of those variables that are already collected and if the contribution of some variables 
in reducing the loss function is less than a certain threshold, the algorithm removes them from the 
active set. The algorithm is required to repeat the backward step until no such nodes are found. By 
choosing an appropriate threshold, the algorithm terminates in a finite number of steps. Jalali et al. 
and Johnson et al. utilized this method in conjunction with a Gaussian log-likelihood function to 
learn the underlying structure of spare GMRFs and proved the consistency of their algorithm for 
sparse Gaussian models ISl fTOll . 

We also adopt the forward backward method in one of our learning algorithms. The algorithm at 
the forward step adds the best candidate to the active set of a node of interest based on a conditional 
mutual information test. In the backward step, the algorithm removes all the unlikely neighbors from 
the active set at one shot unlike the aforementioned related work, which do so at intermediate steps. 
The algorithm repeats the two steps to estimate the neighborhoods of all the nodes in the graph one 
node at a time. 

Next section presents the motivations behind our two-step approach using the geometric interpreta¬ 
tion of the conditional mutual information test. 


2.4 Geometric representation 

For zero-mean Gaussian random variables, the conditional mutual information between Xi and Xj 
given Xs is related to the minimum distance between the rejection vectors of Xi and Xj from the 
subspace spanned by Xs in the Hilbert space of second-order random variables. As depicted in 
Figure [T] let Tj and Yj be the rejection vectors of Xi and Xj from the subspace [ATs] spanned by 
Xs, respectively. Then the minimum distance between Tj and Yj is related to the conditional mutual 
information I{Xi-,Xj\Xs). We establish this relationship in the next Lemma. 

Lemma 1. Let {Xi, Xj, Xs} be a subset of a zero-mean multivariate Gaussian vector X = 
{Xi, ...,XnY. Then 

2I{Xp,XfXs) = logE[F/] - logminE[(ri - aY^f], (5) 

a 

where Yi = Xi — {P'^Xs with /?' = argmin^E[(Xi — P'^Xs)'^], andYj = Xj — {P"Y'Xs with 
P" = argmin^E[(Xj - P'^ Xsf]. 
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Figure 1: Geometric representation of/(Xi;Xj|Xs). 


Proof. Proof See Appendix l7.2l □ 

Corollary 1. Let X be an n-dimensional zero-mean Gaussian vector with corresponding GMRF, 
G = (y, E). Then, for every node i £V, Yi, the rejection vector of Xi from is orthogonal to 

Xj for every t ^ {i} U Mi. 

Proof. From the definition of GMRF, we have / (Xi; Xj\Xj<j-fj =0 for every t ^ {z} U Mi. Lemma 
[Uimplies = 0. Moreover, = 0. This follows from the definition of Xj and because 

Yi is orthogonal to Hence, ¥,[YiXj] = E[Til^-] + E[liX'] = 0. □ 

Next result gives a test to identify non-neighbors of a node i from a subset S that does not contain i 
but contains all neighbors of i. This means Mi C S. 

Theorem 1. Let S C {1, ...,n} contains all the neighbors of a node i, but not i. Then, the zero 
entries correspond to the non-neighbors of node i, where Ds is a diagonal matrix 

whose entries are the main diagonal ofYs s- 


Proof. See Appendix l7.3l □ 

3 Greedy neighborhood selection via mutual information test 

In this section, we present our first learning algorithm, which outputs Si, the estimated neighbor¬ 
hood of a given node i. Initially, Si is an empty set. At round k, the algorithm finds a node that 
maximizes the conditional mutual information with i conditioned on S^ , where the superscript 
is used to denote the number of rounds. Namely, ji = argmax^.^^gCfc-i)^^ I{Xi\Xj\X ^(k-i)). If the 
corresponding conditional mutual information is below a given threshold ep, the algorithm stops and 
outputs the current estimate. Otherwise, it prunes the unlikely neighbors in set Si using Theorem[T] 

More precisely, it computes the vector u* = gC^-i) removes those 

nodes from the active set whose corresponding values in this vector are smaller than a calibrated 
threshold ep- Algorithm [T] summarizes these steps. 

Proposition 1. If node i has only one neighbor, Algorithm\I\always returns the correct neighbor 
after one round. 


Proof. See Appendix l7.4l □ 

When the neighborhood of a node contains more than one node, Algorithm[T]does not necessary find 
a neighbor at each iteration. In spite of that, we can prove the structural consistency of the algorithm 
for sparse GMRFs. 

3.1 Structural Consistency for Sparse Gaussian Models 

In this section, we prove the sttuctural consistency of Algorithm [T] for a class of Gaussian models 
that satisfy the so-called restricted eigenvalue property. 
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Algorithm 1 Finding Neighbors of Node i 


1 : Input : S, i. Threshold ep, 0 < v < 1 
2: Output : Si 
3: ^ 0, and fc ^ 1. 

4: while tine do 

5: Next candidate; ji -s- argmax _ /(Xi; X, |A„(ic-i)) 

6 : Updating the active set: -f— U {ji} 


7 

8 
9 

10 

11 

12 


^ -2 

Calibration factor for pruning threshold; -f— E^e 
fc •«— A: + 1 

if (5 — /(Xi;Xji |Xg(fc_i)) < ep then 




break 

end if 


Xg^-l))) 


13: es -f— — e“2<5)fcj 

14: L •<- {i : |m*| < es} 

15: Pruning step: -f— 

16: end while 


Assumption 1. Let —{i} := {1,..., n} \ {i}. We assume that 3 Cmin > 0 <^nd p > 1 such that the 
partial covariance matrix := E[X_{i}X^^^j] satisfies 

CminW^Wp < ||S_{i}A||ir < pCminW^WPi 

where A is an arbitrary sparse vector with at most pd non-zero entries, and p > 2 + 

^pHViP^-p)/d + V2)\ 


As it is discussed in Qol, restricted eigenvalue assumption imposes a more relaxed condition on 
the model parameters compared to the condition imposed by the -regularized Gaussian MLE lfT4ll 
or the condition imposed by the linear neighborhood selection with -regularization CD. Fur¬ 
thermore, under the restricted eigenvalue assumption, Johnson et al. ifTOll show the sparsity of the 
forward-backward greedy algorithm that optimizes the following loss function: 

( 6 ) 


This algorithm picks j* = argminjg(suppp)=,o! C(/3 -f aCj) as its best next candidate and removes 
j = arg minjgsiipp /3 >C(/3 — fijCj) as its least likely neighbor, where ej is a unite vector with only one 
non-zero entry, located in the j-th position. 

In order to show the structural consistency of our algorithm under Assumption [T] we present the 
next lemmas that guarantee if the aforementioned forward-backward greedy algorithm with the loss 
function in (| 6 ll returns the neighborhood of a node i, so does Algorithm[T] 

Remark 1. Unlike the loss function in conditional mutual information criterion selects the next 
candidate only based on its projection proportion which is geometrically the only proportion that 
matters for neighborhood selection. The reason is as follows: let ji and j 2 be the nodes that are 
chosen in the 5-th line of Algorithm\I\and using the loss function in given the active set s\^ 
respectively. From Lemma\I\ we have 


ji 


argmax 


E^[YiYf 

E[i; 2 ] 


72 


argmax 


E^[YiYf 

E[(X') 2 ]+E[ 1 ^ 2 ]> 


where Yj and Xj are the rejection and projection components of the orthogonal projection of Xj 

onto the subspace spanned by respectively. From Corollary [7] we know that the subset 

contains Mi if and only ifYi is orthogonal to Yj for every t ^ {i} U This implies 

that only the rejection components of Xj and Xi after the orthogonal projection onto the subspace 

spanned by X^^k-i) are relevant quantities to check whether s[^ contains the neighborhood of 
i. Hence, geometrically, ji is a better candidate to be the next neighbor of node i. 
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Lemma 2. Let ep = ^log for some 0 < e < 1, then if Algorithm\I\ adds node j to the 
active set S of node i, it decreases the loss function given in dS]) by at least kijC, where ki j = 
E,, exp(- 2 /(X,; Xs) - 2I{X,-Xs)) >0. 


Proof. Proof See Appendix l7.5l □ 

Lemma 3. If the forward greedy algorithm guarantees no false exclusions, then the pruning step 
excludes all non-neighbors. Moreover, the most unlikely node removed by the forward-backward 
greedy algorithm always belongs to the set L identified in Algorithm\I\s pruning step. 

Proof. Proof See Appendix 17.61 □ 

Under Assumption!!] Lemmas 1 and 3 in (jS) and Lemma|2| guarantee no false exclusions using the 
mutual information test as long as a proper forward stopping threshold ep is selected. Lemma |3l 
guarantees no false inclusions in the backward part of the mutual information test. Hence, we will 
have the following result: 

Theorem 2. In Algorithm\T\ let ki := kij, Ki := maxj,^i kij, and ep := ^ log such 

that 1 > e > min{l — e,8cprid\ogn/{CminNki)}, where d is the maximum node degree in the 
graphical model, n is the number of nodes, c and 0 < e <C 1 are constants. Under Assumption 
[7] if the nonzero entries of vector | are lower bounded by \J ?i2peKi j Crain, ond the 

number of samples N > Cdlogn for some constant C, there exist constants ci and C 2 such that 
with probability at least 1 — Ci exp(—C 2 A^), Algorithm\I\will terminate infinite number of steps and 
return the exact neighborhood of the given node i. 

Proof. If I {Xp, Xj\Xs) > ep. Lemmas [T]and|2]imply that after adding t to the active set S, the loss 
function £ in (| 6 |) decreases by at least eki. On the other hand, from Lemmas [T] and [3 we know that 
nodes that are removed by the pruning step in Algorithm [T] will increase the loss function £ by at 

most u[\ — Note that ^ is precisely the amount of decrease 

in £ as a result of adding j to the active set S. Thus at each round, the loss function £ reduces by at 
least (1 — v)eki and hence, Algorithm[T]terminates within a hnite number of steps. 

Using Lemma 9 in ifThl . Theorem 2 in ifTOl . and the fact that at each round in Algorithm [T] £ 
decreases by at least eki > Scprjdlogn/CminN, we obtain that Algorithm [T] given N samples will 
return the exact neighborhood of i with probability at least 1 — ci exp(—C 2 A). □ 

4 Neighborhood selection via thresholding 

In this section, we present Algorithm |2| which is very similar to the one that was first introduced 
in ||2l on Ising models. Unlike the hrst algorithm we presented, which selects new neighbors and 
prunes potential false neighbors at the same time, this algorithm selects all potential neighbors hrst 
and then prunes false neighbors. For this algorithm, we assume that Assumption |2] holds and the 
parameters involved in the inputs of the algorithm, are known. 

For a walk summable Gaussian graphical model, the algorithm works with high probability. If we 
compute the empirical covariance between i and j given set Si as 

= ^ij ~ , (7) 

where we assume that the number of samples is large enough so that E^ ^g exists, then the maximum 

of Eij|s^ can be lower bounded by a constant, where Si is the estimated neighborhood of node i and 
the maximium is taken over j, the undiscovered neighbors of i. Hence, in each round, we can simply 
select the neighborhood set by thresholding the absolute value of the conditional covariance, which 
guarantees to hnd at least one neighbor with high probability. The size of the pseudo neighborhood 
can be upper bounded, and thus pruning can be performed efficiently. 

There are numerous ways of pruning the neighborhood Si (for example, Qol). Here we use an 
efficient yet simple one, given in Algorithm[3| It simply computes F = jS^ g.Eg^g |, and prunes 


7 





all the nodes with the corresponding entries that’s below a threshold va, where a is the same as in 
Assumption 121 and v G [0,1). We prove that this algorithm prunes the neighborhood efficiently, 
while preserving the neighbors with high probability. 

4.1 The thresholding and pruning algorithms 

Assumption 2. Consider a Gaussian graphical model satisfying the following set of assumptions. 

• The model is a walk summable. 

• The diagonal elements of J is bounded by dmin <^nd dmax- 

• The absolute values of the off-diagonal non-zero elements are lower bounded by a, and 
upper bounded by b. 

• The degree of the graph is upper bounded by a known number A. 

Remark 2. Notice that the second part of the third bullet is not required in order for the algorithms 
to function correctly. It is only used in deriving theoretical guarantees in a simpler form. In fact, 
given the first two assumptions, the absolute values of off-diagonal non-zero entries are naturally 
upper bounded by dmin because J is positive definite. 

Under Assumption|2] we present Algorithms|2][3] in which e and v are set manually. The correctness 
of these algorithms under the assumption is proven in the next section. 


Algorithm 2 Greedy Neighborhood Selection for Node i via Thresholding 

1: Input. Nj cv, djYiiji, djjiax^ 

2 : Output: Si. 

3: Initialization: G- 0, and k ^ 0. 

4: TG- ad-lffdl,^^{l + a) - - e 

5: while A:<Ado 

6 : For all j G ^ ^ 

7: +1) ^ ) u {j G n({*} U Sf^) : | > r}’ 

8 : k ^ k-\-l 

9 : if then 

10 : break 

11: end if 

12: end while 

13: 5, ^ ^ 


Algorithm 3 Pruning the estimated neighborhood Si 

1. Input. Si, E, Oi, timm; d,Yiax^ 

2 : Output: Sf. 

3: rff va 

4: f ^ |Ei_S.Sg.^gJ 

5: L ^ Find(f < r^) 

6 : ^ S,\L 


4.2 Algorithm Efficiency 

We characterize the efficiency of Algorithm |2] with two upper bounds, which help derive the com¬ 
putational and sample complexity of our algorithms later on: ( 1 ) the upper bound for the size of 
Si, and (2) the upper bound for the number of iterations. These two aspects are good proxies for 
characterizing the algorithmic efficiency, since a good algorithm should always be able to select all 
the actual neighbors with a small number of iterations, while keeping the number of non-neighbors 
in Si at minimum. 

We first upper bound Si selected by Algorithm|2]with the following result. 


8 











Theorem 3. Suppose that t is the threshold used in Algorithm^ Assumption^holds, and assume 
that the absolute values of off-diagonal non-zero entries of J are upper bounded by b. Then the 
number of nodes selected into Si is upper bounded by 


l^.l< 


(1 - 


(8) 


where is the actual degree of node i. 


Proof See Appendix 17.71 


□ 


The above theorem bounds 15^1 by a constant times A^. However, it is worth pointing out that if 
the size of the graph is smaller than the upper bound given in Theorem [3 a tighter upper bound is 
needed. 


Meanwhile, the number of iterations of the algorithm, which is at most A, can also be upper 
bounded. 

Proposition 2 (Upper bound for A). For a Gaussian graphical model satisfying Assumption]^ we 
must have 


A < 


'max^ 

a 


2 


(9) 


Proof. See Appendix l7.8l □ 

Remark 3. The upper bound for Ai is independent of the graph size, mainly due to the assumption 
of a walk summability. By plugging in Proposition^to]^ we see that |S'i| is upper bounded by a 
constant. 

With these results, we can now characterize the computational complexity of Algorithms |2] and [3 
for each node. 

For each node, the selected neighborhood Si always at least as large as A^, but is upper bounded 
by a constant that’s independent of the graph size at the same time. This implies a computational 
complexity of 0{n) for a set of fixed parameters, since Algorithm |3 iterates at most A rounds, 
which is upper bounded by a constant, and in each round, at most n conditional covariances are 
computed. For the pruning algorithm, the computation of each S ^k) involves inverting E ^k) „(k ), 

requiring at most 0(15^1^) computational complexity. Since | Si \ can be upper bounded by a constant 
independent of n, the computational complexity of the pruning algorithm given a, dmin, dmax, a, 
is essentially 0(1). 


4.3 Correctness 


We now show that, with the threshold chosen as in Algorithm |2l Si contains the neighborhood of 
node i, J\fi, with high probability This conclusion is based on a lower bound for the absolute value 
of maximum conditional covariance between a node and its undiscovered neighbors at any point, 
conditioned on the estimated neighborhood Si at that point. This property for Gaussian graphical 
models that meet Assumption|2]is introduced as follows. 

Lemma 4. Under Assumption^ denote the estimated neighborhood of node i at any point by Si, 
and assume there are K neighbors of node i undiscovered. Then 


max 


> 


1 __ 

Kdl[du{l + a)dmax-\\J^,^^^\l?' 


( 10 ) 


where Mi\Si = {j S U :j GAfi,j Si} and da is the diagonal element of J matrix corresponding 
to node i. 


Proof is in Apr)endix l7.9l 

With the above lemma, we can design the threshold according to the following Theorem. 
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Theorem 4. Assume that a Gaussian graphical model satisfies Assumption^ Then, for any node i 
and any estimated neighborhood Si, 


max IE 


y|Sil > 


jdMASi' + a) - a?)' 


( 11 ) 


Proof. The proof follows directly from LemmalU observing that |1 A/iVS, Hi ^ Kaf, and K > 
1 . ' ' ' U 


If we further restrict the graph to be free from triangles, a tighter bound can be obtained as follows. 

Corollary 2. Assume that a Gaussian graphical model satisfies Assumption \2\ and that the graph 
does not contain triangles. Then, for any node i and any estimated neighborhood St, 


max 


IE, 


ij\Si I ^ 


> 


M'Lax -a^)' 


( 12 ) 


Proof. See Appendix IT.lOl 


□ 


For the normalized case, the right hand side further simplifies to a(l — which is intuitively 

correct since a larger value of a makes learning the neighborhood easier as it “separates” the non¬ 
zero entries from the zero entries. Also note that the conditional covariance involved in the proof are 
exact, which indicates that if we have infinite samples, then Algorithm|2]will return at least 1 neigh¬ 
bor per round. When we have only finitely many samples, the performance of the algorithm depends 
on how well the empirical conditional covariance concentrates around the conditional covariance, 
which is the topic of the next subsection. 

The intuition for walk summable Gaussian graphical models directly provide the proof for the cor¬ 
rectness of Algorithm^ 

Theorem 5. Assume a Gaussian graphical model satisfies Assumption\2l and for node i, Afi C Si. 
LetT = T,i^Si'S‘g^ g., and assume that the j-th element of Si corresponds to node Si{j) in the graph. 
Then, T^ = 0 if sfij) ^ M, andTj = -Ji^siU) if Si{j) S M- 

Proof. The proof follows directly by noticing that when Afi C Si, we have Ei g^ = —Ji s^Es. g^. 

’ □ 


This result shows that we can set the pruning threshold G (0, a), and it will prune all the non¬ 
neighbors and preserve all the neighbors asymptotically. The concentration results involved in work¬ 
ing with finite number of samples are differed to the next subsection. 

4.4 Sample Complexity 

The structural consistency of our algorithm, along with sample complexity, is stated in the following 
result. 

Theorem 6 (Structural consistency). For a Gaussian graphical model satisfying Assumption^ de¬ 
note the thresholds selected by Algorithms^and^as r and t'p, respectively. Then, given N i.i.d. 
samples, there exists universal constants C 3 , 6 * 4 , C 3 such that when N scales as 

W > log n, (13) 

with probability at least l — Cn exp(—CaW), the combination of Algorithm^ffollowed by Algorithm 
^returns the actual neighborhood of node i. 


Proof. See Appendix IT.llI □ 

This indicates that if the number of samples scales as Vt{a~^ logn), then the structural consistency 
is guaranteed. 
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4.5 Scalability 


We finally show that Assumption |2] required for Algorithm |2] does not affect the scalability of the 
graph. This is a question which arises naturally from Assumption |2] since we required a walk 
summability and lower bounded the absolute values of off-diagonal non-zero entries. It also arises 
from the the result of Theorem[3 namely, since 15'^ | is upper bounded by a constant times , where 
Ai is also upper bounded by a constant, is it possible that under Assumption the graph size is 
restricted to be small? 

To answer this question, we present the following result, which implies that there exists graphs of 
arbitrary size under a fixed set of parameters in Assumption]!] The results are given in Propositions 
|3] in which we assume that A is tight, i.e., it is equal to the largest node degree in the graph. 

Proposition 3 (Sufficient condition for scaling the graph). Given the parameters in Assumption \2\ 
there exists a walk summable Gaussian graphical models of arbitrary size if 

1 < A < (14) 

0 

Proof. See Ar)r)endix l7.12l □ 

5 Simulation 

In this section, we illustrate the performance of our algorithms and the conclusions drawn previously 
with the help of numerical methods. 

5.1 The forward-backward mutual information test 

We simulated the performance of the mutual information test, forward backward greedy algorithm 
in Jlol, and the Lasso method on the following graphs types; chain, star, grid, diamond, and ran¬ 
domly generated . The threshold of the greedy algorithm is set to be relatively large, since small 
thresholds permit the forward greedy algorithm to select a much larger neighborhood than the actual 
one, in which case the computational complexity of the pruning would increase. The threshold is set 
to be fixed, or varying such that e oc log(n) /N decreases as a function of N, as in ifTOl . so that when 
enough samples are observed, \J‘i2peKr/Cmin is lower than the smallest entry of jE^. 
and the structural consistency is guaranteed by Theorem]!] We generated the entries of the inverse 
covariance matrix randomly. The simulation for Lasso uses the code in Q. To encourage sparsity of 
the results obtained by Lasso, we use the largest regularizer possible such that the mean square error 
is within 1 standard error of the minimum mean square error, determined by the standard fc-fold 
cross validation. For all the experiments. Lasso took up the majority of time, while the greedy algo¬ 
rithms were fast. Thus, we compare the results of Lasso for graphs of relatively small size, and for 
larger sized graphs, we compare the performance of Algorithm JT] and the greedy forward backward 
algorithm of iHQi. 

We use two metrics to compare the the algorithms: (1) success rate, defined as the portion of nodes 
in the graph whose neighborhood is correctly estimated, averaged over 100 trials, and ( 2 ) the accu¬ 
racy of the test measured byl — |4lAA|/|A|, where A is the true support of the inverse covariance 
matrix and A is the estimated support. Note that A A B := {{i,j) : Aij f Bij} when A and B 
are the adjacency matrices of two graphs. 

We first compare the performances of the mutual information test, forward-backward greedy algo¬ 
rithm, and Lasso on the chain graph (n = 10,d = 2), the star graph (n = 10, d = 2), the grid graph 
(n = 9, d = 4), and the diamond graph (n = 4, d = 3). We chose Cmin to be 0.1, and p ranged from 
3 to 10. The threshold was set as e = c\og{n)/N, where c is the tuning parameter. In backward 
pruning i/ = 0.5. For these special graphs, we can see in Figures]!]and]3]that the mutual information 
test behaves as good as the forward-backward greedy algorithm. Greedy approaches have similar or 
better performance than Lasso with much lower computational complexity for small graphs. Much 
better performance was observed by QO) when the threshold decreased as a function of the sample 
size for larger sized graphs, in both computational complexity and sample complexity. 

We next compared the performance on random graphs of size 10 (d = 6) and 20 (d = 13), with aver¬ 
age number of edges for each instance around 20 and 51, respectively, p is set to at least 10 to allow 
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Accumc,;SuccB!»_BatB _ _ _ ^ ^ Accuracy / Success Rate 



(a) Chain of length 10. 



(b) Star of size 10. 


Figure 2: Performance comparison between greedy algorithms and Lasso with decreasing threshold 




(a) Diamond Graph. (b) 3 x 3 grid. 

Figure 3; Performance comparison between the greedy algorithms and Lasso with fixed threshold 


easier generation of inverse covariance matrix satisfying the restricted eigenvalue constraints. The 
results are shown in|4] It can be seen that Algorithm [T] is slightly superior to the forward-backward 
greedy algorithm, when the graph becomes denser. This is mainly due to the fact that the forward 
step of our algorithms uses the conditional mutual information test, which has a higher chance of 
selecting the correct neighbors. 




(a) n = 10, \E\ ^ 20. 


(b) n = 20, |L;| Ri 51. 


Figure 4; Performance comparison between Algorithm [T] and forward-backward greedy algorithm 
on random graphs with decreasing threshold. 
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(a) Upper bound of | 5 'i|/Ai as a function of a for (b) An example comparing the threshold adopted by 
different values of 6/a. the oracle and Algorithmic It can be seen that the 

looseness of the algorithm allows false neighbors to 
be selected. 

Figure 5; Upper bound for |5'i j/A^ and an illustration of the threshold used by Algorithmic and the 
lower bound provided by the corresponding lemma. 


5.2 The thresholding algorithm 
5.2.1 Algorithm efficiency 

We first demonstrate the result of Theorem|C assuming that we have infinite samples (and hence the 
exact covariance matrix E), and that drain = dmax = 1 so that we only have freedom in choosing 
a, a and h. When the threshold is selected as in Algorithmic we have 


\Si\ ^ (1 + g)^ ^ 

Ai “ (1 — a)2 ’ 


(15) 


assuming that A, > 0. This implies that the graphical model is easy to learn when (1) a is small, 
and (2) when b/a is not too large. We hence plot |5'i|/Ai as a function of a for different b/a ratios. 
As can be seen from Figure |5al the graph is easier to learn when b/a is small and when a is small. 
In addition, decreasing value of a does not increase the hardness of learning the graph as long as 
b/a is fixed. Finally, we point out that even when the upper bound is large, it can be seen from the 
numerical results shown in later sections that the actual size of jiSil is small. 


5.2.2 Normalized case: a random instance 

For simplicity and space limit, we only demonstrate the normalized case without triangles, the case 
where the diagonal entries of J are all ones. This frees us from setting different dmin and dmax, and 
the results are easy to track, although we also note that the result for generalized case will be slightly 
worse than the normalized case as well. We randomly generate graphical structures, rejecting those 
instances that contain triangles, and then generate the off-diagonal entries of the upper triangle 
matrix using i.i.d. Gaussian distribution with a = 0.01. We scale the off-diagonal entries to make 
sure that the spectral radius of |i?| is below a certain level of a, and reject those that violate the entry 
wise lower bound. 

Since the actual lower bound adopted is quite loose, which we shall see later, it is likely that Algo¬ 
rithmic selects a superset of the actual neighborhood with very high probability. Hence, the follow¬ 
ing algorithm can be implemented right after Algorithmic Algorithmic simply checks whether the 
adjacency matrix obtained is symmetric. It can be easily seen that if Algorithm 1 succeeds, then i 
and j must be simultaneously in each other’s estimated neighborhood if they are actual neighbors. 
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(b) Oracle Version of Algorithmic accompanied by 
Algorithmic 




(c) Algorithms IC and 1C with 1E6 samples (d) Algorithms ICH] and|C with 1E6 samples 

Figure 6: One random instance with 20 nodes. The true graph contains 22 edges. The oracle 
version and the Algorithmic accompanied by Algorithm|Cfor pruning returns the correct graph. The 
Algorithm IC alone returns a graph containing all 22 true edges and 19 false edges. The pruning 
algorithmICis applied automatically. We set a = 0.4, a = 0.01, h = 0.28 and A = 10. For triangle 
free graphs, we substitute the threshold with|2] 


Algorithm 4 Pruning by Symmetry 

1 : Input: Si, S 2 , ■■■, Sn- 
2: Output: S'l, S 2 , ■■■, S'^. 

3: Initialization: A ^ 2 ;eros(n, n). 

4: Set A{i, Si) 1 for all i = 1 ,..., n 

5: For all {i,j), if A{i,j) ^ A{j, i), set A{i,j) ^ 0, and A{j, i) ^ 0. 
6 : Set S'' ^ find(A(i,:) ^ 0) for all i = 1,..., n 


We first demonstrate the effectiveness of the threshold algorithm accompanied by rough pruning 
with Algorithm 0] without applying Algorithm |C for finer pruning. A random sample graph is 
generated, and the results of the algorithms running on graphs of size 20 are shown in Figure |6] We 
see that the combination of Algorithms |2] and |C manages to find all the true neighbors, while not 
selecting too many false ones. Further pruning by Algorithm |C gives the correct graph. It can be 
seen that the actual size of |Si| is much lower than the upper bound. Meanwhile, to compare with 
the actual algorithm, we also provide an oracle version of Algorithm|Cby substituting the threshold, 
which is designed according to Corollary |2] for triangle free graphs, by the lower bound provided in 
Appendix IT.lOl that corresponds to LemmalC and providing all the necessary information including 
K and \\Ji^^f^\Si Hi’ ^^d S to the algorithm. It can be seen that the lower bound provided in|4](here 
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it’s the corresponding version for triangle free graphs), is quite tight. The difference between the 
actual bound adopted by Algorithm |2] and its oracle version for the triangle free graphs, is depicted 
in Figure l5bl 

5.2.3 Normalized case: probability of success 

We next plot the probability of success for our algorithm, combining Algorithms 1 to 3, as a func¬ 
tion of the number of samples required. The simulation is carried out on 20 node random graphs 
averaging over 100 instances. 

We compare the result of our algorithm to the forward-backward greedy algorithm of Qni, which 
has been shown to outperform the method of Neighborhood Lasso. The main idea of the forward- 
backward greedy algorithm is to first repeatedly select the node into Si that minimizes a loss 
function, which is the distance of Xi to the linear vector space spanned by Xj with j G Si in 
P), until the change in the loss function is below a certain threshold e^. After the forward 
part of the algorithm terminates, the backward part of the algorithm prunes the node that causes 
least amount of change in the loss function until the change is greater than vcs, where zz G (0,1) is 
chosen arbitrarily. 

For the forward-backward greedy algorithm, we set (following their notations) Cmin = 1/(1 + «)> 
d be the actual degree upper bound of each generated instance, p=(l-|-Q!)/(l — a). The forward 
stopping threshold is set to Cs = ScprjdXogim)/{kCmin), where k is the number of samples, and 
T] = \2 + — p/d -f The tuning parameter c, which is used to determine the value 

of Es, is set from 10“^ to 10“"^. Notice that c is a tuning parameter. Even though for different c 
the algorithm will always be structurally consistent, it has to be manually tuned for different set 
of parameters (m, a, a for entry wise lower bound, and d for degree upper bound) in order for 
the algorithm to converge fast. This can be observed from Figure |7^ where different values of c 
yields different rates of convergence. In fact, many algorithms require the knowledge of tuning 
parameters, a study on such problem can be found in IfTTl . Finally, we point out that the forward- 
backward greedy algorithm is more powerful when used to select the graphical structure as a whole, 
compared to neighborhood selection, and that the forward-backward greedy algorithm is likely to 
be more efficient when the number of samples is small. More details can be found in IfTOl . 

We analyze the results are shown in Figure |7^ We also calculated how many edges in total both 
algorithms selects, shown in Figure |7b] where each algorithm is accompanied by the raw pruning 
done by Algorithm|4] From the figure, we can see that (1) the performance of the forward-backward 
greedy algorithm is very sensitive to the tuning parameter c, while our algorithm does not involve 
any notion of tuning parameter; (2) the performance of our algorithm matches the best performance 
for the forward-backward greedy algorithm with the five choices of c; (3) Both our algorithm, and 
the forward-backward greedy algorithm with the best choice of c, are very efficient in the selecting 
the pseudo neighborhood (although forward-backward greedy algorithm prunes the pseudo neigh¬ 
borhood every time it picks a new node) when the number of samples is large. The reason that our 
algorithm will select a large pseudo neighborhood when the number of samples is small is due to 
the choice of the pruning threshold in Algorithm^ which we arbitrarily set to 10“^. 


6 Conclusion and future work 


In this paper, we studied the problem of neighborhood selection for walk summable Gaussian graph¬ 
ical models. We presented a novel property for those type of models, which lower bounds the max¬ 
imal absolute value of the conditional covariance between a node and its undiscovered neighbors. 
Based on this property, we presented two algorithms which greedily selects the neighborhood by 
thresholding the conditional covariance, and prunes the potentially false neighbors, respectively. 
When the graph does not contain any triangles, the bounds can be tightened. We characterized the 
efficiency of the algorithm in terms of the upper bound for the ratio between the selected pseudo 
neighborhood by thresholding and the actual degree of the algorithm. We also showed that the 
number of iterations executed by the algorithm is also upper bounded. We gave computational com¬ 
plexity results for the algorithms we propose, and we presented results on sample complexity and 
structural consistency of the algorithms when working with finite number of samples. We simulated 
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(a) Probability of Success. Both the forward selection 
parts and backward pruning parts of the proposed al¬ 
gorithm and the forward-backward greedy algorithms 
are applied (as opposed to the other part of this fig¬ 
ure). 



(b) Pseudo neighborhood size selected by the forward 
part of the algorithm, pruned only by symmetry. 


Figure 7; Performance evaluation on 20 node random graphs. When c decreases from to 

the forward stopping threshold for the forward-backward greedy algorithm is large enough 
so that the algorithm will not select a very large pseudo neighborhood. When c = the 

threshold becomes too small, and the pseudo neighborhood size becomes very large. 


our algorithm for the triangle free version, and the numerical results showed that the algorithms 
work well in reality. 

Future work includes the following aspects; 

• Can |S'i| be more efficiently upper bounded? 

• Can we develop a similar technique for more generalized type of graphs, mainly Gaussian 
graphical models that are not walk summable? 

The answer for the second question is more fundamental, and we believe it requires a clever way of 
connecting and Si,7v;, and a clever way of exploiting the zero patterns of the J matrix. 
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7 Appendix 

7.1 Walk Summable GMRF 

We start off by introducing several concepts that lead to the concept of walk-summability. 

For any information matrix J that is symmetric and positive definite, we denote Jnorm to be its 

normalized version, which can be obtained by letting 



(16) 


An information matrix J itself is said to be normalized if J = Jnorm, i e-. the diagonal elements of 
J are all ones. Otherwise, we can represent J by 



(17) 


where D is the diagonal matrix of J, with Du = Ju and Dij = 0 for i f j. 

Next, we denote R as the partial correlation matrix, which satisfies R = I — Jnorm- The entry Rij 
measures We denote |i?| as the matrix satisfying \R\ij = \Rij\. 

With these two concepts, we proceed to the definition of walk summability, which is parameterized 
by a. 

Definition 1 (a walk summability |[T|). A GMRF is said to be a walk summable if \\\R \\\2 < a < 1. 

The most elegant property of the walk summability is that it allows us to relate the covariance 
between Xi and Xj as the weight sum of all the paths between nodes i and j in the underlying graph 
specified by J, where the edge {i,j) weighs Jij and the weight of a path is the product of all the 
edge weights along the path. We illustrate this in the following example, which comes in useful 
later. 

Example 1 (Normalized J matrix). Consider the special case where J is itself normalized, i.e., 
J = Jnorm- In this case. 


OO 



(18) 
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This shows that T,ij is the sum of for all non-negative integers k. Since R preserves the graphical 
structure of J between different nodes, it’s immediately seen that R^j represents the summation of 
weight of all the paths from i to j with k hops. 

Another interesting property of the walk sumtnable Gaussian graphical model is that it has restricted 
eigenvalues, which often appears as the condition for exact recovery of the neighborhood (with high 
probability) using Lasso and greedy methods ifTOll . Here we present it as a lemma. 

Lemma 5. Assume a Gaussian graphical model with n vertexes is a walk summable, and Ju S 
[t^mm J ,Vr G then Vx € , 

(1 + <||5]x||2 < (1 - (19) 

(l Ol^dmiri Iklb <\\Jx \\2 < (1 + a)d,nax\\x\\ 2 - (20) 

Proof Denote the D as the diagonal matrix of J. Then J = y/13{I — R)'/D. Since || |i ?|||2 < a, 
we know that —a < Xr < a. Hence, (1 — a)dmin < Aj < (1 + a)dmax- Since J = 

(1 + ar^df,\^ < As < (1 - (21) 

□ 


This indicates that a walk summability is a stronger condition than the restricted eigenvalue con¬ 
dition, in the sense that it applies to all vectors x, while the restricted eigenvalue condition only 
requires sparse vectors, see for example Qo). Hence, the well explored methods such as Lasso and 
forward-backward greedy algorithms can be readily applied to walk summable Gaussian graphical 
models. The reverse direction of this relationship, however, may not be true, since we cannot imply 
II |i?|||2 < 1 from —a < Xr < a, while in order for the model to be walk summable, we must have 

|||i?|||2<llll. 

In addition, notice that the value of (1 -f a)draax/\(f — ct)dmin] characterizes the strength of the 
constraint on the eigenvalues of J and E. With the same value of (1 — a)dmin, if we decrease one 
of a and dmax/dmin while holding the other, then the eigenvalues of E are restricted to a smaller 
region. 

Finally, for convenience of further reasoning, we point out the following fact. 

Corollary 3. For an a walk summable Gaussian graphical model, let S <Z V. Denote = 

Jss\S’ = Jss, = Egg|g, = Ess. Then Vx G R", and for i = 1,2, we have 

(1 - a)drmn\\x \\2 < ||Mi*)x||2 < (1 + a)d„iax\\x\\ 2 , (22) 

(1 -f a)"^d-i^||x||2 < ||A^^*^a:||2 < (1 - a)-^df^^J\x\\2■ (23) 

Proof. When i = 2, the proof follows directly from Lemma|5] by applying the interlacing property 
of the eigenvalues for a submatrix. When i = 1, the proof follows similarly by first using the 
relationships (|2]i and (O, and then apply the interlacing property of the eigenvalue and Lemma 
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7.2 Proof of Lemma [T1 

Denote the projection of Xi and Xj onto Xr by X' and Xj, respectively. Then, X' = (/3')^Xs, 
where /3' = argminsE[(Xi — ff^XsY] and it can be shown that X' = Ei^sEss^fs- Similarly, 
Xj = Ej-sEgjgXs. Hence, 


minE[(r, - aYjf] = E^^] - 


m?] 


— Eii — Es^jEg gEs,i — 


(Ejj — X^s,i^s,s'^s,j) 


= Eij|s — 


y2 

^ij\S 
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where the last equality uses the expression for conditional covariance between Xi and Xj condi¬ 
tioned on Xs- The proof is then completed by noting that 


2I{X,-Xj\Xs) 


log 






log 


__ 

minQE[(yi - aYj)^]' 


7.3 Proof of Theorem [TJ 

As we discussed, the support of J = identifies the corresponding MRF of a jointly Gaussian 
system. Hence, a node j dose not belong to A/i if and only if = 0. Consequently, the {i,j) 
minor of E, denoted by , is zero for every t ^ AfiU {*}. Mij is the determinant of the matrix 
that results from deleting row i and column j of E. This implies that for every j ^ A/i U {i}, after 
removing row i from E, the i-th column of the resulting matrix (denote it by S') can be written as 
a linear combination of columns {1,n} \ {*, j}. Using this observation and the fact that E is 
positive semidefinite, we obtain 

E'=EV,u, (24) 

where |A/i| is the number of i’s neighbors. S' and denote the *-th column of E' and the sub¬ 
matrix of E' comprising rows with the index set A/i, respectively. Therefore, if 5” = {A/i, 5” \ A/i} 
and i ^ S, Equation (l24l i implies 




Note that Eg 5 is non-singular, because it is a principal minors of a positive-semidefinite matrix E. 

7.4 Proof of Proposition [TJ 

Denote the node of interest by Xi, and its neighbor by Xj. We show that I{Xi;Xj) = 
maxg I{Xi; Xg). This is because, if we consider an arbitrary non-neighboring node s, then 


I{X,; Xg, X,,Xr) = I{X,; Xj) + /(X,; Xg,XR\Xj) = I{Xf, Xg) + I{X,; X,,XR\Xg), 

where R = {l,..,n} \ {i,j,s}. By the definition of RMF, I{Xi] Xg, XrIXj) = 0, while 
I{X,;X„XR\Xg)>0. 

7.5 Proof of Lemma 121 

LetcF = ^logp^, then I{Xi-Xj\Xg(k-i)) > ep implies ■= where 

the latter is the amount of decrements in the loss function (|6]l after adding t to the active set of node 
i. It is not hard to see that ^ log kij = | log Sr — I{Xi] X^^k-i)) — I{Xj; X^ik-i)). 

7.6 Proof of Lemma |3j 

The first part is a direct consequence of Theorem [T] Next, we show that the j* obtained in the 
backward step of Algorithm 2 in Qo) belongs to the set L defined in the 12th line of Algorithm 

[T] given the active set Recall Yi = Xi — {(5')'^X where P' = argmin^ E[(Aii — 

/3^Ar_g(fc-i))^]. Then, 




J 


argmin E 
tGSf 





argmin E [PjXj {2Yi + ■ 


By projection theorem, we must have E[2fjli] = 0. Hence, j* = argmin -^gfk-i) (/3j)^ Sjj, which 
corresponds to the minimum entry of the vector u* defined in the 11-th line of AlgorithmJTj 


7.7 Proof of Theorem m 

Assume that the algorithm was executed m rounds before terminating. Denote the selected new 
neighbors at round k by , and the entire set of selected neighbors at the end of round k by S)) . 
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(25) 


Let = 0. Then, 

m 

= X! (^Z2|Sf 

k^l 

m 

= E^. 




fc=i 




By CorollaryH the eigenvalues of are inside the region [{l-a)dmin, {^ + ot)dmax\, 

and by the way the threshold is designed, 




(26) 


Notice that the left hand side of (l25l l can be upper bounded by J^u — To prove this, first 

notice that when C we have E..,„(*,+!> < E .as can be seen from the each 

‘ * ll\b^ 

individual term inside the summation of (l25T l. since is positive definite. Secondly, 

(A;) 

when Ni Q , E^ = 0, by local Markov property. Combining these two observations 

shows that Ejjjg^ is always non-increasing when new neighbors are selected, but remains unchanged 
once all neighbors has been selected, which proves the upper bound for the left hand side. 

Hence, 

(1 ^ Ej^ Ejv/^ 

<{l-a)-^d-]yX. (27) 

We thus arrive at the conclusion that 


\S^\< 


(1 - afdl^i^r 


:A,,. 


(28) 


7.8 Proof of Proposition |2] 


Without loss of generality, assume node 1 has A neighbors, from node 2 to A -f 1. Then, consider 
the subgraph involving these A -f 1, whose partial correlation matrix is denoted by R. Denote the 
adjacency matrix of a star graph with A + 1 nodes by A, then, since no triangles exist in the graph, 
we have 

T^PI|2 <|||i?|||<a, (29) 


where the hrst step is because p{A) < p{B) when A < B entry-wise and when A is positive, and 
the second inequality is by assumption. 

Notice that for a star graph of dimension A -|-1, the spectral radius of adjacency matrix can be easily 
computed (by definition of eigenvalue) to be ^fK. Hence, 

dmax^ 
a 




where the equality hold when the subgraph containing any one of the nodes with degree A and all of 
its neighbors is a star graph, with dmin = dmax and all non-zero off-diagonal entries take the same 
value a. 


7.9 Proof of Lemma |4] 

We start off by proving the normalized case and then extend the proof to the generalized case. For 
the normalized case, the diagonal elements of J are ones, as given in Example[T] From (|2|, we know 
that the conditional covariance matrix can be obtained by inverting J55. Notice that J5 5 preserves 
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the structure of the original graph on the subset of nodes S. Hence, we can always treat Jg 5 as the 
J matrix of the subgraph defined on set of nodes S. Hence it’s sufficient to show 


maxE? > 


1 UmWI 


Notice that when J is normalized, we have (fTsT l. which implies 
have 


(31) 

Thus, we 


jeAfi 


We now claim that 


Ji ,Afi ,Mi ,Afi ^i ,Afi 


Ji ,Afi Ji ^Afi 

(1 - 


(32) 


(33) 


where A = {JA/i^Afi — Jjf. j^. ■— ^ {^})- To show this, we exploit 

the fact that = 0 when i and j are non-neighbors. 

Let S = Afi U i. By block matrix inverse. 


Ss,s = (^Js,s - Ts,sc Jgc 5 ^ 


-1 


(34) 


Notice that the first row of Js,S‘ and the first column of Js<=,s are all zeros, we must have 


Jii Ji^Afi 


(35) 


Js,s — Ts.gc Jg,, Jgc 5 = 

where A is the block matrix introduced previously. 

Remember that we wish to find Ejvi.A/'iJ which is the block matrix at the bottom right corner when 
inverting the right hand side of dTSl l. Hence, by Schur’s complement, we have 


^Afi,Afi — A-f AT/Vi^2(l Ji^Afi^JAfi^i) Ji^Afi^ 


= A 


^^Afi ,iJi,Afi ^ 

1 Ji.Afi^^Afi^i 


Hence, 


TV-1 7 A I Ji,Afi^JA\fi,iJi,Afi^ 

'Ji,Afi^Afi,A/i — T/iTVi-'r H - z 

1 - Ji, Afi^JAfi, i 


= 1 + 


Ji,AfiJ^JAfi,i 
1 Ji,AfiJ^JAfi,i 
Ji,AfiJ^ 


Ji,Afi-^ 


Hence the claim holds, and we have 


1 




j^J^i 


{l-J^,AfAJlAfi?' 


(36) 


(37) 


(38) 


Notice that the eigenvalue of A is bounded within [(1 + a)~^, (1 — a)“^], and that the quadratic 
forms in both numerator and denominator share the same eigenvectors. Hence, the quadratic forms 
in both numerator and denominator achieve the minimum at the same time, and by the positive 
definiteness of the right hand side of dTSl) . we have AJjvi.i < 1- Hence, 




II 


2 

2 


((l + a)-||J.xJli)^’ 


(39) 
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The proof is completed by noting that the maximum of a group of real values is lower bounded by 
the average. 


For the generalized case, J = '/DJnormVD, which indicates that 

\/Dm, 

\/ = Jnorm,i,J\fi \/ • 




and 


Hence, 

2 Jnorm,i,J\fi'' 

2 -^ “ 




Jnorm,i,N'i Mi 'i/^MiMi '^norm,iMi 


JiMi^-N'iMi 


dl 


dl {di, - J*.a/;AJ5^J2 


> 


Um,M 


dfiidiii^l Oi^d^ax ll'A.A/illi) 

where A = 


2\2 ’ 


(40) 

(41) 


(42) 


7.10 Proof of Corollary|2] 

We first prove a corresponding version of Lemma|4]when the graph does not contain any triangles. 

Under Assumption|2] denote the estimated neighborhood of node i at any point by Si, and assume 
the graph is free from triangles, and that there are K neighbors of node i undiscovered. Then 


^ K dl{du maXjgjv-.\Si djj - \\JiMi\Si\\l)'^ ' 


> 


1 


\\JiMi\SiM 


(43) 


To prove this, we again consider the normalized case. The first few steps are identical, we hence 
start from equation (jSS). 


Since the graph does not contain any triangles, and J is normalized with diagonal elements being 1, 
JMMi must be an identity matrix. Notice that the following normalized symmetric positive definite 
matrix 


I B 
C ’ 


(44) 


satisfies the property that the largest eigenvalue of BC~^B"’^ is 1, and BC~^B^ is positive definite 
(since / — BC~^B^ has to be positive definite and C~^ is positive definite). Since A“^ is of the 
above form, the minimum eigenvalue of A is 1. Noticing that < 1 (so that the numerator 


without squaring is always positive), and that A^A shares the same eigenvectors with A, we have 

(45) 


y2 ^ WdiMiWl 


The proof is completed by noting that the maximum of a group of real values is lower bounded by 
the average. 


For the generalized case, we have, by similar argument. 


Xi 2 1 

>_ ll^wlli _^ 

“ diidii maxjgjv; d^j - ’ 

where A = 


(46) 


The proof of the corollary is then completed by observing that || ||| > Ka^, and K > 1. 
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7.11 Proof of Theorem |6] 


To perform sample based analysis, we first cite the results in lfT4ll and |[T], which provide concentra¬ 
tion guarantees of the covariance and conditional covariance. For our convenience, we translate the 
notation involved in those results. 

Lemma 6 (Concentration of empirical covariances mm). For any n dimensional Gaussian ran¬ 
dom vector X = {Xi ,..., the empirical covariance obtained from N i.i.d. samples satisfies 






r Ne^ 1 

p 


> £ 

< 4 exp 

3200M2 


(47) 


for all e G (0,40M) and M = max^ Tin. 

Lemma 7 (Concentration of empirical conditional covariance Cl). For a walk summable n- 
dimensional Gaussian graphical model where X = {Xi ,..., Xn), and 

'^ij\S — ^ij ~ (48) 


we have 


max 


> £ 

< 4n’t+2 0xp 

Ne^' 

Cl 

LSCV^,|S|<r/ 






(49) 


where Ci G {Q,oo) is a bounded constant //'||E||oo < oo, and again e C (0,40M), and N > p. 


From these two lemmas, we are able to show that the backward pruning algorithm succeeds with 
high probability as well, in the following lemma. 

Lemma 8 (Pruning correctness). Assume that for node i, the estimated neighborhood Si chosen by 
Algorithm^satisfies Mt C Si- LetT = Ti^Si'Sig^S ' Then 


P 




r nc2\s,\^£^] 

jjf ~ I jjoo > £ 

< 4 exp 

3200M2 


(50) 


where N is the number of samples. C 2 G (0, c») is a bounded constant i/HEHoo < 00. £ G 
(0,40M), andN> |S'i|. 


Proof. See Ar)pendix l7.13l 


□ 


We now prove the result of Theorem|6]with the help from the above lemmas. 
By union bound. 


P [Algorithms fail] < P [Algorithm 1 fails] + P [Algorithm 2 fails [Algorithm 1 succeeds]. (51) 

For Algorithmic consider directly Lemma |7] The algorithm performs at most A rounds, which is 
upper bounded by dminCt/a- In each round, the algorithm checks the conditional covariance over 
all unselected nodes, and if one has a larger discrepancy between and Tij^s^ then e, then an 

error occurs. Thus, it can be obtained by union bound, that 


P [Algorithms fail] < 


ndr, 


-P 


■ 


■ 

max 


> e 

.5cy,|S|<r/ 


_ 


< 


Ad, 


■ min ^ 77+3 

- n '^ exp 




Cl 


(52) 


where 


V = 




{adJaxidl,^^ - a^) 


is the upper bound for 15^1. 


e)2 


(53) 
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Since the upper bound does not depend on the graph size, there exist a constant C 3 = p + S + Cai, for 
which if iV = C 3 logn, then P[Algorithm 1 fails] < C 41 exp(—C 51 -/V), where C 41 = AdminOi/a 
and C 51 = C 31 IC 3 . 

Next, consider the backward pruning algorithm. The algorithm fails if one of the non-neighbors of 
node i has corresponding entry in T that’s greater than e = or when the an actual neighbor of 
node i has its corresponding entry in T smaller than tP. Hence, by Lemma0 we have 

P [Algorithm 2 fails I Algorithm 1 succeeds] < (^42 exp(—(752^^), (54) 


for any N, where C 42 = 4, and 

_ . (NCh^ NC2{a-ef \ 

52 mm 13200M2 ’ 3200M2 J ' 

Hence, the theorem holds true by letting C 3 = rj + ^ + C 31 , C 4 = 
min{C5i,C'52}. 


(55) 

max{C'4i, ( 742 }, and C 5 = 


7.12 Proof of Proposition |3] 

The left part of the condition makes sure that the graph is not empty. Denote the adjacency matrix 
of the graph as A, then, we have 

|||i?|||2<;^||7l||<^A<a. (56) 

^min ^min 

Since for any \R\ij ^ 0, \R\ij < ^ , the first inequality follows directly from Gelfand formula. 

The second step holds according to"|2, which upper bounds the spectral radius of a graph’s adja¬ 
cency matrix by its degree upper bound. 

Since there exists J matrix of arbitrary dimension n that satisfies the constraints imposed by the 
parameters a, b, a, A, where A <f^ the size of the graph can thus be arbitrarily large for any set 
of parameters satisfying the specified condition in this proposition. In addition, the degree upper 
bound A can be arbitrarily large as well, if a and b are small. 


7.13 Proof of Lemma |8] 


For any S containing m elements, assume Es.S = ^S,S + F, and £5 5 = ^ + E. Then 

{Es^s + E){Es,s + F) = I, 


indicating that 

E =-^g^sF{Es,s + F)-\ 

Consider the matrix norm jljSs gjlj := mmax^ jes |S„j (page 342, KT)). Then 


llli^lll < 




|(Ss,s -I- Ey 


(57) 

(58) 


(59) 


implying that for some constant (7, 

max \Eij\ < Cmy max \Eij\. (60) 


Now let S = Si- Combining with the boundedness of 'EiSi , there exists a bounded 

constant C 2 , such that when jE^ — E^j < e with high probability, [Eis^E^.g. — Eis^E^.g. | < 
(72m^£ with high probability. Hence the result follows by applying Lemma|6] 
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